## library(devtools)
## library(BiocManager)
library(SingleCellExperiment)
library(ggplot2)
library(scFeatures) ## devtools::install_github("SydneyBioX/scFeatures")
library(ClassifyR) ## BiocManager::install("ClassifyR", dependencies = TRUE)
library(ggthemes)
library(spicyR) ## BiocManager::install("spicyR")
library(dplyr)
library(limma)
library(plotly)
library(scattermore)
library(tidyr)
library(survival)
library(survminer)
library(spatstat)
##library(spatstat.core) ## install.packages("spatstat")
##library(spatstat.geom)
library(scater)
library(scran)
theme_set(theme_classic())
As single cell technology advances, the recent development of spatial omics allows us to examine the spatial organisation of cells within tissues in their native environment. This workshop will discuss the challenges and analytical focus associated with disease outcome prediction using multi-sample spatial datasets. We will also talk about general analytical strategies and the critical thinking questions that arise in the workflow.
sessionInfo at the end of this document to ensure you are
using the correct version.Structure for this 2-hour workshop:
| Activity | Time |
|---|---|
| Introduction to spatial technologies | 20m |
| Cell segmentation with deep learning (with BIDCell) | 20m |
| Exploring spatial data | 20m |
| Break (Q&A) | 10m |
| Extracting features from spatial data (with scFeatures) | 30m |
| Performing disease outcome classification (with ClassifyR) | 20m |
The widely-known METABRIC breast cancer cohort has recently had imaging mass cytometry generated for a subset of it. The publication describing this data is Imaging Mass Cytometry and Multiplatform Genomics Define the Phenogenomic Landscape of Breast Cancer, Nature Cancer, 2020. IMC has cell-level resolution. There are 483 cancer sample with IMC data. However, the subset of interest is those patients who do not have lymph node metastasis. Can their risk of recurrence accurately be predicted and therefore inform how aggressively they need to be treated? The other component of the analysis is patient clinical data, which has been sourced from Supplementary Table 5 of Dynamics of Breast-cancer Relapse Reveal Late-recurring ER-positive Genomic Subgroups, Nature, 2019.
Basic characteristics of the data objects:
load("~/data/breastCancer.RData")
data_sce <- IMC
print("data format")
## [1] "data format"
assay(data_sce)[1:5, 1:5]
## MB-0002:345:93 MB-0002:345:107 MB-0002:345:113 MB-0002:345:114
## HH3_total 4.82032390 3.8229411 2.55845170 4.82208870
## CK19 1.21185160 1.3223530 0.13832258 0.33366668
## CK8_18 2.80274720 2.4720588 0.60545164 2.43351100
## Twist 0.14651649 0.1176471 0.09877419 0.20791112
## CD68 0.07863736 0.1016471 0.03225806 0.05237778
## MB-0002:345:125
## HH3_total 2.41391110
## CK19 0.07055555
## CK8_18 0.20724444
## Twist 0.12004445
## CD68 0.00000000
The original data has been restricted to images with at least 400 cells, no lymph node cancer and Stage 1.
In this demo, we will fit survival models and evaluate the features selected to build them.
At the start of the exploration, it is often good to get a sense of the complexity of the data. We usually do this by exploring the distribution of the outcomes and various variables in the patient’s meta-data. Here, we use cross-tabulation to examine the following variables:
print("Stage and death")
## [1] "Stage and death"
table(clinical$Breast.Surgery, clinical$Death, useNA = "ifany")
##
## 0 1
## BREAST CONSERVING 38 14
## MASTECTOMY 16 6
## <NA> 2 1
print("Number of patients based on ER status")
## [1] "Number of patients based on ER status"
table(clinical$ER.Status)
##
## neg pos
## 12 64
print("Number of patients based on Grade")
## [1] "Number of patients based on Grade"
table(clinical$Grade)
##
## 1 2 3
## 14 30 28
Typically in single-cell data analysis, we perform dimension reduction to project the high dimensional cell by gene matrix on to 2D space. This allows us to visualise various things of interest, such as distribution of cell types and disease outcomes. In this dataset, cells were classified into 22 cell types based on their markers.
Note: for single-cell RNA-seq with around 20,000 genes, we often need to perform some filtering (e.g. select highly variable genes) to reduce the number of features. Here, given we have less than 50 proteins, there is no need to pre-filter. That being said, we provide some sample code below (commented out) to demonstrate how to identify highly variable genes followed by UMAP visualisation in scRNA-seq data.
# gene_var <- modelGeneVar(data_sce)
# hvgs <- getTopHVGs(gene_var, prop=0.1)
# data_sce <- runUMAP(data_sce, scale=TRUE, subset_row = hvgs)
data_sce <- runUMAP(data_sce, scale=TRUE)
a <- plotUMAP(data_sce, colour_by = "description")
b <- plotUMAP(data_sce, colour_by = "metabricId")
a + b
Interactive Q&A:
What can we learn from these illustrations? Is there anything interesting in the plot? Questions to consider include:
The advantage with spatial omics is that we can examine the organisation of the cell types as it occurs on the tissue slide. Here, we visualise one of the slides from a patient. As an optional exercise, you can
to give a sense of what is random ordering.
one_sample <- data_sce[, data_sce$metabricId == "MB-0263"]
one_sample <- colData(one_sample)
one_sample <- data.frame(one_sample)
tableau_palette <- scale_colour_tableau()
color_codes <- tableau_palette$palette(10)
a <- ggplot(one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = description)) + geom_point(alpha=0.7) + scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Original slide")
print("Optional: Permute the cell type labels")
## [1] "Optional: Permute the cell type labels"
one_sample$description_permute <- sample(one_sample$description)
b <- ggplot(one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour =description_permute)) + geom_point(alpha=0.7) + scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Permute the cell type label")
print("Optional: Permute the spatial coordinate")
## [1] "Optional: Permute the spatial coordinate"
one_sample$Location_Center_X_permute <- sample(one_sample$Location_Center_X)
one_sample$Location_Center_Y_permute <- sample(one_sample$Location_Center_Y)
c <- ggplot(one_sample, aes(x = Location_Center_X_permute , y = Location_Center_Y_permute, colour = description)) + geom_point(alpha=0.7) + scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Permute the X, Y coordinate")
a + b + c
Critical thinking:
We begin by examining how can we identify and visualise regions of tissue where spatial associations between cell-types are similar. There are many packages that perform this task andhere we use the lisaClust function that is based on “local L-function” to spatially cluster cells into different regions with similar cell type composition.
This has already been pre-loaded for you. Please proceed to the next task.
set.seed(51773)
BPPARAM <- MulticoreParam(16)
# Cluster cells into spatial regions with similar composition.
data_sce <- lisaClust(
data_sce ,
k = 5,
Rs = c(20, 50, 100),
sigma = 50,
spatialCoords = c("Location_Center_X", "Location_Center_Y"),
cellType = "description",
imageID = "ImageNumber" ,
regionName = "region",
BPPARAM = BPPARAM
)
metadata <- colData(data_sce)
metadata <- metadata[metadata$metabricId == "MB-0263", ]
metadata <- data.frame(metadata)
plotlist <- list()
plotlist_celltype <- list()
thisregion <- unique(metadata$region)[1]
tableau_palette <- scale_colour_tableau()
color_codes <- tableau_palette$palette(10)
color_codes <- c(color_codes, "darkgrey")
names(color_codes) <- c(unique(metadata$description) , "other regions")
for ( thisregion in sort(unique(metadata$region))){
selected_region_index <- metadata$region == thisregion
metadata$selected_region <- "other regions"
metadata$selected_region[selected_region_index] <- "selected region"
metadata$celltype <- metadata$description
metadata$celltype[!selected_region_index ] <- "other regions"
metadata$celltype <- factor(metadata$celltype, levels = c(unique(metadata$description), "other regions"))
p <- ggplot(metadata, aes(x = Location_Center_X , y = Location_Center_Y , colour = selected_region )) + geom_scattermore(pointsize = 1.5) + ggtitle(thisregion) + scale_colour_manual(values = c("grey" , "red"))
p2 <- ggplot(metadata, aes(x = Location_Center_X , y = Location_Center_Y , colour = celltype )) + geom_scattermore(pointsize = 1.5) + ggtitle(thisregion) + scale_colour_manual(values = color_codes)
plotlist [[thisregion ]] <- p
plotlist_celltype [[thisregion ]] <- p2
}
ggarrange(plotlist = plotlist , ncol = 5, nrow = 1 , common.legend = T )
ggarrange(plotlist = plotlist_celltype , ncol = 5, nrow = 1 , common.legend = T )
We can better interpret the region output by summarising the proportion of each cell type in a region across the patients. Looking at the composition of cell types in each region, comparing between Young and Old Patients. A cutoff of 50 years old will be used.
df <- data.frame(colData( data_sce))
clinical$AgeGroup <- ifelse(clinical$Age.At.Diagnosis < 50, "Young", "Old")
df_plot <- NULL
for ( thispatient in unique(df$metabricId)){
this_df <- df[df$metabricId == thispatient, ]
temp_df <- table( this_df$region, this_df$description )
temp_df <- temp_df / rowSums(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$group <- clinical[clinical$metabricId == thispatient, "AgeGroup"]
df_plot <- rbind(df_plot, temp_df)
}
df_plot <- df_plot %>% dplyr::group_by( Var1 , Var2, group) %>%
summarise(mean_proportion = mean(Freq))
# r1 <- df_plot[ df_plot$Var1 == "region_1", ]
ggplot(df_plot, aes(y = Var2, x = Var1 ,colour =mean_proportion , size = mean_proportion ))+ geom_point() +
facet_grid(~group, scales = "free", space = "free" ) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
xlab("Region" ) + ylab("Celltype") + scale_colour_viridis_c()
Interactive Q&A:
Q4: Which regions appear to be different between Young and Old?
df <- clinical[colData(data_sce)[, "metabricId"], ]
df$region <- data_sce$region
df$description <- data_sce$description
df <- df %>% dplyr::group_by(metabricId , AgeGroup, region) %>%
count(description) %>%
mutate(proportion = n / sum(n))
ggplot(df, aes(y = proportion, x = metabricId , fill = description))+ geom_col()+facet_grid(~region+AgeGroup, scales = "free", space = "free" ) + scale_fill_manual(values = c(color_codes)) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Interactive Q&A:
Q5: Does your conclusion change after looking at a different plot?
We see that region 1 appears to suggest the non-responder patients have more melano. Region 3 appears to be the tumor microenvironment with lots of Th.ae (antigen-experienced) and macro.mono (macrophage and monocytes) cell types. Let’s focus on region 1 and region 3 and visualise the data with boxplots, as well as comparing to the overall cell type proportion without segmenting into regions.
df_plot <- NULL
for ( thispatient in unique(df$metabricId)){
this_df <- df[df$metabricId == thispatient, ]
temp_df <- table( this_df$region, this_df$description )
temp_df <- temp_df / rowSums(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$group <- unique( this_df$AgeGroup)
df_plot <- rbind(df_plot, temp_df)
}
df_plot_region_1 <- df_plot[df_plot$Var1 == "region_1", ]
a <- ggplot(df_plot_region_1, aes(x = Var2, y = Freq, colour = group)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type")+ ggtitle("Region 1") + ylim(0, 0.25)
df_plot_region_3 <- df_plot[df_plot$Var1 == "region_3", ]
b <- ggplot(df_plot_region_3, aes(x = Var2, y = Freq, colour = group)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type") + ggtitle("Region 3") + ylim(0, 0.25)
overall <- NULL
for ( thispatient in unique(df$metabricId)){
this_df <- df[df$metabricId == thispatient, ]
temp_df <- table( this_df$description )
temp_df <- temp_df /sum(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$group <- unique( this_df$AgeGroup )
overall <- rbind(overall, temp_df)
}
c <- ggplot(overall, aes(x = Var1, y = Freq, colour = group)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type") + ggtitle("Overall composition")
a + b + c
Often you may have a marker in mind to further examine the expression of key marker genes in these region specific cell types. For example, we select cells that have high Ki67 expression. (ie, only keeping the cells that have Ki67 expression higher than the median Ki67 expression in the whole dataset). We choose Ki67 as an example here because Ki67 is strongly associated with tumor cell proliferation and growth and is widely used as a biomarker in cancer analysis.
median_ki67 <- median( logcounts(data_sce[ "Ki67" , ]))
data_sce$ki67 <- ifelse( logcounts(data_sce[ "Ki67" , ]) > median_ki67, "high_ki67" , "low_ki67")
df_plot <- NULL
for ( thispatient in unique(df$metabricId)){
this_df <- df[df$metabricId == thispatient, ]
temp_df <- table( this_df$region, this_df$description )
temp_df <- temp_df / rowSums(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$group <- unique( this_df$AgeGroup )
df_plot <- rbind(df_plot, temp_df)
}
df_plot_region_1 <- df_plot[df_plot$Var1 == "region_1", ]
a <- ggplot(df_plot_region_1, aes(x = Var2, y = Freq, colour = group)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type")+ ggtitle("Region 1") + ylim(0, 0.25)
df_plot_region_3 <- df_plot[df_plot$Var1 == "region_3", ]
b <- ggplot(df_plot_region_3, aes(x = Var2, y = Freq, colour = group)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type") + ggtitle("Region 3") + ylim(0, 0.25)
overall <- NULL
for ( thispatient in unique(df$metabricId)){
this_df <- df[df$metabricId == thispatient, ]
temp_df <- table( this_df$description )
temp_df <- temp_df /sum(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$group <- unique( this_df$AgeGroup )
overall <- rbind(overall, temp_df)
}
c <- ggplot(overall, aes(x = Var1, y = Freq, colour = group)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type") + ggtitle("Overall composition") + ylim(0, 0.25)
a + b + c
Discussion:
Comparing the overall composition and the cell type composition in the region, what did you observe about the regions?
In this demo, we use scFeatures to generate molecular representation for each patient from the matrix of proteins by cells. The molecular representation is interpretable and hence facilitates downstream analysis of the patient. Overall, scFeatures generates features across six categories representing different molecular views of cellular characteristics. These include:
The different types of features constructed enable a more comprehensive multi-view understanding of each patient from a matrix of proteins x cells.
Given in the previous section, we clustered the cells into regions, we can use the region information as an additional layer of information on top of the cell types to examine region-specific cell-type specific features.
region <- data_sce$region
region <- gsub("_" , "", region)
data_sce$celltype <- paste0( data_sce$description , "-" , region)
print("number of cells in each sample")
## [1] "number of cells in each sample"
table(data_sce$metabricId)
##
## MB-0002 MB-0064 MB-0128 MB-0130 MB-0132 MB-0136 MB-0138 MB-0142 MB-0145 MB-0154
## 1111 1284 1005 870 1115 1648 509 744 982 338
## MB-0168 MB-0175 MB-0180 MB-0192 MB-0201 MB-0208 MB-0225 MB-0227 MB-0232 MB-0240
## 819 1002 1247 808 580 625 1036 1357 847 1026
## MB-0245 MB-0246 MB-0249 MB-0252 MB-0255 MB-0256 MB-0258 MB-0260 MB-0263 MB-0275
## 1282 859 474 794 932 801 499 1016 755 418
## MB-0308 MB-0318 MB-0320 MB-0347 MB-0351 MB-0367 MB-0377 MB-0383 MB-0394 MB-0397
## 1116 1380 847 789 803 552 1026 773 1201 1275
## MB-0400 MB-0401 MB-0405 MB-0410 MB-0420 MB-0439 MB-0451 MB-0454 MB-0455 MB-0461
## 1138 923 819 1138 532 1307 913 1195 605 1630
## MB-0463 MB-0467 MB-0470 MB-0475 MB-0480 MB-0481 MB-0487 MB-0498 MB-0508 MB-0516
## 1592 886 865 1742 593 1685 967 1338 1226 1086
## MB-0518 MB-0520 MB-0531 MB-0536 MB-0571 MB-0584 MB-0591 MB-0596 MB-0604 MB-0605
## 662 1393 607 1023 681 1078 940 1679 1182 634
## MB-0628 MB-0635 MB-0636 MB-0642 MB-0661 MB-0662 MB-0663
## 1730 836 524 1255 1553 1238 567
print("number of cells in each celltype - Region specific cell type")
## [1] "number of cells in each celltype - Region specific cell type"
table(data_sce$celltype)
##
## B cells-region1 B cells-region2
## 5 111
## B cells-region3 B cells-region4
## 11 157
## B cells-region5 Basal CKlow-region1
## 13 125
## Basal CKlow-region2 Basal CKlow-region3
## 132 104
## Basal CKlow-region4 Basal CKlow-region5
## 574 28
## Endothelial-region1 Endothelial-region2
## 20 253
## Endothelial-region3 Endothelial-region4
## 24 238
## Endothelial-region5 Fibroblasts CD68+-region1
## 54 146
## Fibroblasts CD68+-region2 Fibroblasts CD68+-region3
## 829 42
## Fibroblasts CD68+-region4 Fibroblasts CD68+-region5
## 843 80
## Fibroblasts-region1 Fibroblasts-region2
## 842 3191
## Fibroblasts-region3 Fibroblasts-region4
## 296 3894
## Fibroblasts-region5 HER2+-region1
## 400 149
## HER2+-region2 HER2+-region3
## 97 26
## HER2+-region4 HER2+-region5
## 230 42
## HR- CK7--region1 HR- CK7--region2
## 1924 602
## HR- CK7--region3 HR- CK7--region4
## 667 2557
## HR- CK7--region5 HR- CK7+-region1
## 824 506
## HR- CK7+-region2 HR- CK7+-region3
## 443 2823
## HR- CK7+-region4 HR- CK7+-region5
## 2635 863
## HR- CKlow CK5+-region1 HR- CKlow CK5+-region2
## 50 128
## HR- CKlow CK5+-region3 HR- CKlow CK5+-region4
## 6 155
## HR- CKlow CK5+-region5 HR- Ki67+-region1
## 7 117
## HR- Ki67+-region2 HR- Ki67+-region3
## 148 61
## HR- Ki67+-region4 HR- Ki67+-region5
## 955 30
## HR+ CK7- Ki67+-region1 HR+ CK7- Ki67+-region2
## 293 158
## HR+ CK7- Ki67+-region3 HR+ CK7- Ki67+-region4
## 280 1128
## HR+ CK7- Ki67+-region5 HR+ CK7- Slug+-region1
## 453 58
## HR+ CK7- Slug+-region2 HR+ CK7- Slug+-region3
## 39 28
## HR+ CK7- Slug+-region4 HR+ CK7- Slug+-region5
## 789 147
## HR+ CK7--region1 HR+ CK7--region2
## 1652 1719
## HR+ CK7--region3 HR+ CK7--region4
## 1775 9299
## HR+ CK7--region5 HRlow CKlow-region1
## 8606 1742
## HRlow CKlow-region2 HRlow CKlow-region3
## 353 86
## HRlow CKlow-region4 HRlow CKlow-region5
## 791 271
## Hypoxia-region1 Hypoxia-region2
## 77 165
## Hypoxia-region3 Hypoxia-region4
## 64 616
## Hypoxia-region5 Macrophages Vim+ CD45low-region1
## 23 17
## Macrophages Vim+ CD45low-region2 Macrophages Vim+ CD45low-region3
## 230 19
## Macrophages Vim+ CD45low-region4 Macrophages Vim+ CD45low-region5
## 275 38
## Macrophages Vim+ Slug--region1 Macrophages Vim+ Slug--region2
## 15 200
## Macrophages Vim+ Slug--region3 Macrophages Vim+ Slug--region4
## 16 284
## Macrophages Vim+ Slug--region5 Macrophages Vim+ Slug+-region1
## 26 49
## Macrophages Vim+ Slug+-region2 Macrophages Vim+ Slug+-region3
## 140 37
## Macrophages Vim+ Slug+-region4 Macrophages Vim+ Slug+-region5
## 372 88
## Myoepithelial-region1 Myoepithelial-region2
## 18 262
## Myoepithelial-region3 Myoepithelial-region4
## 261 700
## Myoepithelial-region5 Myofibroblasts-region1
## 150 714
## Myofibroblasts-region2 Myofibroblasts-region3
## 4035 510
## Myofibroblasts-region4 Myofibroblasts-region5
## 4774 722
## T cells-region1 T cells-region2
## 72 1161
## T cells-region3 T cells-region4
## 117 1138
## T cells-region5 Vascular SMA+-region1
## 99 15
## Vascular SMA+-region2 Vascular SMA+-region3
## 256 27
## Vascular SMA+-region4 Vascular SMA+-region5
## 363 38
Discussion:
Are there any samples or cell types you would like to remove from the data?
All the feature types can be generated in one line of code. This runs
the function using default settings for all parameters. For more
information, type ?scFeatures. To save time, just
load the prepared RDS file in the following chunk.
# scFeatures requires the following columns
# celltype, sample, x_cord and y_cord
# alternatively, these can be also set as argument in the scFeatures function
IMCmatrix <- assay(data_sce)
sample = data_sce$metabricId
celltype = data_sce$celltype
spatialCoords <- list(colData(data_sce)$Location_Center_X, colData(data_sce)$Location_Center_Y)
# here, we specify that this is a spatial proteomics data
# scFeatures support parallel computation to speed up the process
scfeatures_result <- scFeatures(IMCmatrix, type = "spatial_p", sample = sample, celltype = celltype, spatialCoords = spatialCoords,
ncores = 32)
scfeatures_result <- readRDS("~/data/scfeatures_result.RDS")
We have generated a total of 13 feature types and stored them in a
list. All generated feature types are stored in a matrix of samples by
features. For example, the first list element contains the feature type
proportion_raw, which contains the cell type proportion
features for each patient sample. We could print out the first 5 columns
and first 5 rows of the first element to see.
# we have generated a total of 13 feature types
names(scfeatures_result)
## [1] "proportion_raw" "proportion_logit" "proportion_ratio"
## [4] "gene_mean_celltype" "gene_prop_celltype" "gene_cor_celltype"
## [7] "gene_mean_bulk" "gene_prop_bulk" "gene_cor_bulk"
## [10] "L_stats" "celltype_interaction" "morans_I"
## [13] "nn_correlation"
# each row is a sample, each column is a feature
data.frame(scfeatures_result[[1]][1:5, 1:5])
## B.cells.region1 B.cells.region2 B.cells.region3 B.cells.region4
## MB-0002 0.0000000000 0.000000000 0 0.00000000
## MB-0064 0.0000000000 0.000000000 0 0.00000000
## MB-0128 0.0009950249 0.044776119 0 0.06666667
## MB-0130 0.0000000000 0.000000000 0 0.00000000
## MB-0132 0.0000000000 0.007174888 0 0.00000000
## B.cells.region5
## MB-0002 0.000000000
## MB-0064 0.000000000
## MB-0128 0.007960199
## MB-0130 0.000000000
## MB-0132 0.000000000
## DT::datatable(meta_table , options = list(scrollX = TRUE))
In this section, we will perform disease outcome classification using the molecular representation of patients. Recall in the previous section that we have stored the 13 feature types matrix in a list. Instead of manually retrieving each matrix from the list to build separate models, classifyR can directly take a list of matrices as an input and run a repeated cross-validation model on each matrix individually. Below, we run 5 repeats of 5-fold cross-validation. The patient outcome is time-to-event, so, by default, ClassifyR will use Cox proportional hazards ranking to choose a set of features and also Cox proportional hazards to predict risk scores. Other models are available. A high score indicates prection of a worse outcome than a lower risk score.
usefulFeatures <- c("Breast.Tumour.Laterality", "ER.Status", "Inferred.Menopausal.State", "Grade", "Size")
nFeatures <- append(list(clinical = 1:3), lapply(scfeatures_result, function(metaFeature) 1:5))
clinicalAndOmics <- append(list(clinical = clinical), scfeatures_result)
### generate classfyr result
classifyr_result <- crossValidate(clinicalAndOmics, c("timeRFS", "eventRFS"),
extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures))),
nFeatures = nFeatures, nFolds = 5, nRepeats = 5, nCores = 5)
Cox proportional hazards is a traditional statistical method. Compare the predictive performance with a machine learning method. Random survival forest will be used. Such models can build remarkably complex relationships between features. Note, however, the running time is much longer than Cox proportional hazards and feature selection is used to limit the number of features considered to 100 at most per metafeature. To save time, just load the prepared RDS file.
nFeatures <- append(list(clinical = 1:3), lapply(scfeatures_result[2:length(scfeatures_result)], function(metaFeature) min(100, ncol(metaFeature))))
survForestCV <- crossValidate(clinicalAndOmics, outcome, nFeatures = nFeatures,
classifier = "randomForest",
nFolds = 5, nRepeats = 5, nCores = 5)
survForestCV <- readRDS("~/data/survForestCV.RDS")
To examine the distribution of prediction performance. use
performancePlot. Currently, the only metric for
time-to-event data is C-index and that will automatically be used
because the predictive model type is tracked inside of the result
objects.
tilt <- theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
performancePlot(append(classifyr_result, survForestCV),
characteristicsList = list(x = "Assay Name", row = "Classifier Name"),
orderingList = list("Assay Name" = names(scfeatures_result))) + tilt
Note how the resultant plot is a ggplot2 object and can be further modified.
The same code could be used for a categorical classifier because the random forest implementation provided by ranger package has the same interface for both.
Next, examine feature selection stability with
selectionPlot.
selectionPlot(append(classifyr_result, survForestCV),
characteristicsList = list(x = "Assay Name", row = "Classifier Name"),
orderingList = list("Assay Name" = names(scfeatures_result))) + tilt
distribution(classifyr_result[[1]], plot = FALSE)
## assay feature proportion
## 1 clinical Inferred.Menopausal.State 1
Using samplesMetricMap compare the per-sample C-index
for Cox and Random Forest models for metafeature gene_cor_celltype.
library(grid)
heatmap <- samplesMetricMap(list(classifyr_result[[7]], survForestCV[[7]]))
grid.draw(heatmap)
A few samples are predicted better by one model than another.
The per-sample C-index is a metric unique to ClassifyR. Models and
feature selection approaches may be seen in the vignette or listed by
available().
Interactive Q&A:
Q8: Is highest predictive performance the only way to choose the best model or are other models the best for other reasons?
The L function is a spatial statistic used to assess the spatial distribution of cell types. It assesses the significance of cell-cell interactions, by calculating the density of a cell type with other cell types within a certain radius. High values indicate spatial association, low values indicate spatial avoidance.
one_sample <- data_sce[ , data_sce$metabricId == "MB-0128" ]
one_sample <- data.frame( colData(one_sample) )
one_sample$celltype <- one_sample$description
index <- one_sample$celltype %in% c("B cells", "Fibroblasts")
one_sample$celltype[!index] <- "others"
a <-ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point() + scale_colour_manual(values = color_codes) + ggtitle( "Patient MB-0128 - high L value with \n B cells interacting Fibroblasts")
one_sample$celltype <- one_sample$description
index <- one_sample$celltype %in% c("melano", "Tc.ae")
one_sample$celltype[!index] <- "others"
b <- ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point() + scale_colour_manual(values = color_codes) + ggtitle( "Patient MB-0128 - low L value with \n B cells interacting HR_ CK7")
a + b
We calculate the nearest neighbours of each cell and then calculate the pairs of cell types based on the nearest neighbour. This allows us to summarise it into a cell type interaction composition.
one_sample <- data_sce[ , data_sce$metabricId == "MB-0263" ]
one_sample <- data.frame( colData(one_sample) )
one_sample$celltype <- one_sample$description
a <-ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point() + scale_colour_manual(values = color_codes) + ggtitle( "Patient MB-0263")
feature <- scfeatures_result$celltype_interaction
to_plot <- data.frame( t( feature["MB-0263", ]) )
to_plot$feature <- rownames(to_plot)
colnames(to_plot)[2] <- "celltype interaction composition"
to_plot <- to_plot[ to_plot$MB.0263 > 0.03 , ]
b <- ggplot(to_plot, aes(x = reorder(`celltype interaction composition`, MB.0263) , y = MB.0263, fill=`celltype interaction composition`)) + geom_bar(stat="identity" ) + ylab("Major cell type interactions") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
a+ b
Moran’s I is a spatial autocorrelation statistic based on both location and values. It quantifies whether similar values tend to occur near each other or are dispersed.
high <- data_sce["Ki67", data_sce$metabricId == "MB-0132" ]
high_meta <- data.frame( colData(high) )
high_meta$expression <- as.vector(logcounts( high))
low <- data_sce["Ki67", data_sce$metabricId == "MB-0249" ]
low_meta <- data.frame( colData(low) )
low_meta$expression <- as.vector(logcounts(low))
a <- ggplot(high_meta, aes(x = Location_Center_X , y = Location_Center_Y, colour =expression)) + geom_point(alpha=0.5) + scale_colour_viridis_c() + ggtitle("Patient MB-0132 - high Moran's I in Ki67")
b <- ggplot(low_meta, aes(x = Location_Center_X , y = Location_Center_Y, colour =expression)) + geom_point(alpha=0.5) + scale_colour_viridis_c() + ggtitle("Patient MB-0249 - low Moran's I in Ki67")
a+b
This metric measures the correlation of proteins/genes between a cell and its nearest neighbour cell.
plot_nncorrelation <- function(thissample , thisprotein){
sample_name <- thissample
thissample <- data_sce[, data_sce$metabricId == sample_name]
exprsMat <- logcounts(thissample)
cell_points_cts <- spatstat.geom::ppp(
x = as.numeric(thissample$Location_Center_X ), y = as.numeric(thissample$Location_Center_Y),
check = FALSE,
xrange = c(
min(as.numeric(thissample$Location_Center_X)),
max(as.numeric(thissample$Location_Center_X))
),
yrange = c(
min(as.numeric(thissample$Location_Center_Y)),
max(as.numeric(thissample$Location_Center_Y))
),
marks = t(as.matrix(exprsMat))
)
value <- spatstat.explore::nncorr(cell_points_cts)["correlation", ]
value <- value[ thisprotein]
# Find the indices of the two nearest neighbors for each cell
nn_indices <- nnwhich(cell_points_cts, k = 1)
protein <- thisprotein
df <- data.frame(thiscell_exprs = exprsMat[protein, ] , exprs = exprsMat[protein,nn_indices ])
p <- ggplot(df, aes( x =thiscell_exprs , y = exprs , colour = exprs )) +
geom_point(alpha = 0.3) + ggtitle(paste0( "Patient ", sample_name , " nn_corr = " , round(value, 2) )) + scale_colour_viridis_c()
return (p )
}
p1 <- plot_nncorrelation("MB-0605", "HER2")
p2 <- plot_nncorrelation("MB-0258", "HER2")
p1 + p2
The correlation differs between the 42RD patient (from cluster 1) and the 29RD patient (from cluster 2).
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 12 (bookworm)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Sydney
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scran_1.28.2 scater_1.28.0
## [3] scuttle_1.10.2 spatstat_3.0-6
## [5] spatstat.linnet_3.1-1 spatstat.model_3.2-6
## [7] rpart_4.1.19 spatstat.explore_3.2-3
## [9] nlme_3.1-162 spatstat.random_3.1-6
## [11] spatstat.geom_3.2-5 spatstat.data_3.0-1
## [13] survminer_0.4.9 ggpubr_0.6.0
## [15] tidyr_1.3.0 scattermore_1.2
## [17] plotly_4.10.2 limma_3.56.2
## [19] dplyr_1.1.3 spicyR_1.12.2
## [21] ggthemes_4.2.4 ClassifyR_3.5.21
## [23] survival_3.5-3 BiocParallel_1.34.2
## [25] MultiAssayExperiment_1.26.0 generics_0.1.3
## [27] scFeatures_0.99.27 ggplot2_3.4.3
## [29] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
## [31] Biobase_2.60.0 GenomicRanges_1.52.1
## [33] GenomeInfoDb_1.36.4 IRanges_2.34.1
## [35] S4Vectors_0.38.2 BiocGenerics_0.46.0
## [37] MatrixGenerics_1.12.3 matrixStats_1.0.0
##
## loaded via a namespace (and not attached):
## [1] SpatialExperiment_1.10.0 R.methodsS3_1.8.2
## [3] GSEABase_1.62.0 progress_1.2.2
## [5] EnsDb.Mmusculus.v79_2.99.0 goftest_1.2-3
## [7] Biostrings_2.68.1 HDF5Array_1.28.1
## [9] vctrs_0.6.3 digest_0.6.33
## [11] png_0.1-8 shape_1.4.6
## [13] EnsDb.Hsapiens.v79_2.99.0 ggrepel_0.9.3
## [15] deldir_1.0-9 parallelly_1.36.0
## [17] magick_2.8.0 MASS_7.3-58.3
## [19] reshape2_1.4.4 httpuv_1.6.11
## [21] foreach_1.5.2 withr_2.5.1
## [23] xfun_0.40 ellipsis_0.3.2
## [25] memoise_2.0.1 proxyC_0.3.3
## [27] ggbeeswarm_0.7.2 zoo_1.8-12
## [29] GlobalOptions_0.1.2 gtools_3.9.4
## [31] SingleCellSignalR_1.12.0 pbapply_1.7-2
## [33] R.oo_1.25.0 prettyunits_1.2.0
## [35] KEGGREST_1.40.1 promises_1.2.1
## [37] httr_1.4.7 rstatix_0.7.2
## [39] restfulr_0.0.15 globals_0.16.2
## [41] fitdistrplus_1.1-11 rhdf5filters_1.12.1
## [43] rhdf5_2.44.0 rstudioapi_0.15.0
## [45] miniUI_0.1.1.1 concaveman_1.1.0
## [47] babelgene_22.9 curl_5.1.0
## [49] zlibbioc_1.46.0 ScaledMatrix_1.8.1
## [51] polyclip_1.10-6 GenomeInfoDbData_1.2.10
## [53] xtable_1.8-4 stringr_1.5.0
## [55] evaluate_0.22 S4Arrays_1.0.6
## [57] BiocFileCache_2.8.0 hms_1.1.3
## [59] irlba_2.3.5.1 colorspace_2.1-0
## [61] filelock_1.0.2 ROCR_1.0-11
## [63] reticulate_1.32.0 magrittr_2.0.3
## [65] lmtest_0.9-40 viridis_0.6.4
## [67] later_1.3.1 lattice_0.20-45
## [69] future.apply_1.11.0 XML_3.99-0.14
## [71] cowplot_1.1.1 ggupset_0.3.0
## [73] RcppAnnoy_0.0.21 pillar_1.9.0
## [75] iterators_1.0.14 caTools_1.18.2
## [77] compiler_4.3.1 beachmat_2.16.0
## [79] stringi_1.7.12 tensor_1.5
## [81] minqa_1.2.5 GenomicAlignments_1.36.0
## [83] plyr_1.8.9 msigdbr_7.5.1
## [85] crayon_1.5.2 abind_1.4-5
## [87] BiocIO_1.10.0 locfit_1.5-9.8
## [89] sp_2.1-0 bit_4.0.5
## [91] codetools_0.2-19 BiocSingular_1.16.0
## [93] bslib_0.5.1 multtest_2.56.0
## [95] mime_0.12 splines_4.3.1
## [97] circlize_0.4.15 Rcpp_1.0.11
## [99] dbplyr_2.3.4 sparseMatrixStats_1.12.2
## [101] knitr_1.44 blob_1.2.4
## [103] utf8_1.2.3 AnnotationFilter_1.24.0
## [105] lme4_1.1-34 listenv_0.9.0
## [107] DelayedMatrixStats_1.22.6 GSVA_1.48.3
## [109] ggsignif_0.6.4 tibble_3.2.1
## [111] Matrix_1.6-1 scam_1.2-14
## [113] statmod_1.5.0 tweenr_2.0.2
## [115] pkgconfig_2.0.3 pheatmap_1.0.12
## [117] tools_4.3.1 cachem_1.0.8
## [119] RSQLite_2.3.1 viridisLite_0.4.2
## [121] DBI_1.1.3 numDeriv_2016.8-1.1
## [123] fastmap_1.1.1 rmarkdown_2.25
## [125] scales_1.2.1 ica_1.0-3
## [127] Seurat_4.4.0 Rsamtools_2.16.0
## [129] broom_1.0.5 sass_0.4.7
## [131] patchwork_1.1.3 BiocManager_1.30.22
## [133] graph_1.78.0 carData_3.0-5
## [135] RANN_2.6.1 farver_2.1.1
## [137] mgcv_1.8-42 yaml_2.3.7
## [139] rtracklayer_1.60.1 cli_3.6.1
## [141] purrr_1.0.2 leiden_0.4.3
## [143] lifecycle_1.0.3 uwot_0.1.16
## [145] bluster_1.10.0 backports_1.4.1
## [147] DropletUtils_1.20.0 annotate_1.78.0
## [149] gtable_0.3.4 rjson_0.2.21
## [151] ggridges_0.5.4 progressr_0.14.0
## [153] parallel_4.3.1 ape_5.7-1
## [155] jsonlite_1.8.7 edgeR_3.42.4
## [157] bitops_1.0-7 bit64_4.0.5
## [159] Rtsne_0.16 spatstat.utils_3.0-3
## [161] BiocNeighbors_1.18.0 SeuratObject_4.1.4
## [163] RcppParallel_5.1.7 jquerylib_0.1.4
## [165] metapod_1.8.0 dqrng_0.3.1
## [167] survMisc_0.5.6 R.utils_2.12.2
## [169] lazyeval_0.2.2 shiny_1.7.5
## [171] htmltools_0.5.6.1 KMsurv_0.1-5
## [173] sctransform_0.4.0 rappdirs_0.3.3
## [175] ensembldb_2.24.1 glue_1.6.2
## [177] XVector_0.40.0 RCurl_1.98-1.12
## [179] gridExtra_2.3 AUCell_1.22.0
## [181] boot_1.3-28.1 igraph_1.5.1
## [183] R6_2.5.1 gplots_3.1.3
## [185] labeling_0.4.3 km.ci_0.5-6
## [187] GenomicFeatures_1.52.2 cluster_2.1.4
## [189] Rhdf5lib_1.22.1 nloptr_2.0.3
## [191] vipor_0.4.5 DelayedArray_0.26.7
## [193] tidyselect_1.2.0 ProtGenerics_1.32.0
## [195] ggforce_0.4.1 xml2_1.3.5
## [197] car_3.1-2 AnnotationDbi_1.62.2
## [199] future_1.33.0 rsvd_1.0.5
## [201] munsell_0.5.0 KernSmooth_2.23-20
## [203] data.table_1.14.8 htmlwidgets_1.6.2
## [205] RColorBrewer_1.1-3 biomaRt_2.56.1
## [207] rlang_1.1.1 spatstat.sparse_3.0-2
## [209] lmerTest_3.1-3 fansi_1.0.5
## [211] beeswarm_0.4.0
The authors thank all their colleagues, particularly at The University of Sydney, Sydney Precision Data Science and Charles Perkins Centre for their support and intellectual engagement. Special thanks to Ellis Patrick, Shila Ghazanfar, Andy Tran for guiding and supporting the building of this workshop.